XML Documents Clustering Using a Tensor Space Model
نویسندگان
چکیده
The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.
منابع مشابه
خوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملDistance Dimension Reduction on QR Factorization for Efficient Clustering Semantic XML Document Using the QR Fuzzy C-Mean (QR-FCM)
The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document semantic structural information has to be taken into account so as to support more precise document analysis. In order to analyze the information represented in XML documents efficiently, researches on XML document clustering are actively in progress. The key issu...
متن کاملTensor Framework and Combined Symmetry for Hypertext Mining
We have made a case here for utilizing tensor framework for hypertext mining. Tensor is a generalization of vector and tensor framework discussed here is a generalization of vector space model which is widely used in the information retrieval and web mining literature. Most hypertext documents have an inherent internal tag structure and external link structure that render the desirable use of m...
متن کاملHcmx: an Efficient Hybrid Clustering Approach for Multi-version Xml Documents
In order to retrieve useful information from large number of growing XML documents on the web, effective management of XML document is essential. One solution is to cluster XML documents to find knowledge that promote effective information management and maintenance. But in the real world XML documents are dynamic in nature. In contrast to static XML documents, changes from one version of XML d...
متن کاملUtilizing the Structure and Data Information for XML Document Clustering
This paper reports on the experiments and results of a clustering approach used in the INEX 2008 Document Mining Challenge. The clustering approach utilizes both the structure and the content information of the XML documents in the Wikipedia collection. The content of the XML documents is measured using the latent semantic kernel (LSK). A well-known problem with the construction of latent seman...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011